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Abstract 

This  short  paper  describes  a  systems  biology  software  tool  that  can  engage  in  a  dialogue  with  a  biologist 
by  responding  to  questions  posed  to  it  in  English  (or  another  natural  language)  regarding  the  behavior  of  a 
complex  biological  system,  and  by  suggesting  a  set  of  “facts  ”  about  the  biological  system  based  on  a  time- 
tested  “generate  and  test”  approach.  Thus,  this  bioinformatics  system  improves  the  quality  of  the 
interaction  that  a  biologist  can  have  with  a  system  built  on  rigorous  mathematical  modeling,  but  without 
being  aware  of  the  underlying  mathematically  sophisticated  concepts  or  notations.  Given  the  nature  of  the 
mathematical  semantics  of  our  Simpathica/XSSYS  tool,  it  was  possible  to  construct  a  well-founded 
natural  language  interface  on  top  of  the  computational  kernel.  We  discuss  our  tool  and  illustrate  its  use 
with  a  few  examples.  The  natural  language  subsystem  is  available  as  an  integrated  subsystem  of  the 
Simpathica/XSS  YS  tool  and  through  a  simple  Web-based  interface;  we  describe  both  systems  in  the 
paper.  More  details  about  the  system  can  be  found  at:  http:  /  /bioinformatics .  nyu .  edu,  and  its 
sub-pages. 


Introduction 

Many  biologists  face  the  hurdle  of  interacting  with  bioinformatics  analysis  tools  that 
require  mathematical  sophistication  and  training.  For  example,  drawing  qualitative 
conclusions  from  time-course  experimental  data  and  simulated  traces  of  mathematical 
models  involves  manually  examining  the  data  plots  -  possibly  generated  from  differential 
or  stochastic  models  -  which  are  often  fitted  to  actual  experimental  observations  by 
means  of  involved  statistical  filtering  procedures.  As  the  number  of  traces  and  the 
amount  of  quantitative  data  increase,  and  their  relationships  become  more  intricate,  this 
process  not  only  becomes  exceedingly  time-consuming,  but  also  bewilderingly  complex. 
In  addition,  the  process  is  further  complicated  by  the  care  needed  to  avoid  false 
inferences  (either  positive,  negative,  or  both)  when  interpreting  experimental  data  that  is 
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corrupted  by  highly  correlated  stochastic  noise  processes — a  problem  that  worsens  with 
dimension.  Unfortunately,  this  is  true  of  all  currently  available  experimental  datasets 
dealing  with  biological  phenomena,  e.g.,  microarray  time-course  experiments  and  models 
of  complex  biological  systems,  as  they  usually  involve  a  large  number  of  experimental 
conditions  that  are  inter-related  with  one  another.  To  address  these  problems,  we  devised 
the  Simpathica/XSSYS  Trace  Analysis  Tool,  a  bioinformatics  system  that  enables  users 
to  query  these  datasets  qualitatively  using  a  propositional  temporal  logic. 

Alas,  the  nature  of  our  solution  to  the  problem  of  complex  data  analysis  introduces  one 
more  layer  requiring  a  specialized  training  in  the  form  of  formulating  hypotheses  in 
temporal  logic.  Therefore,  to  make  the  system  accessible  to  biologists,  we  have  now 
integrated  a  natural  language  query  subsystem  within  the  Simpathica/XSSYS  Trace 
Analysis  Tool.  In  the  following  we  describe  our  approach  and  give  a  few  examples  of  its 
use.  Finally,  as  an  interesting  avenue  of  exploration  we  also  describe  a  prototype 
implementation  of  a  “story  generation”  system  based  on  a  restricted  exploration  of  the 
satisfiability  of  temporal  logic  sentences  over  a  set  of  (simulated)  traces  of  a  biological 
system. 


Figure  1 .  The  Simpathica  Main  Window.  The  system  being  analyzed  is  the  “repressilator”  circuit 
(ELOO).  The  top  left  frame  contains  a  list  of  the  reactants.  The  bottom  left  frame  is  used  to  insert 
different  kinds  of  reactions  selected  from  a  list  of  known  reactions.  Finally  the  right  frame  contains  a 
depiction  of  the  reactions'  network. 
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Description 

The  Simpathica/XSSYS  Trace  Analysis  Tool  (APP+03)  uses  a  branching-time 
propositional  temporal  logic  (E90)  to  formulate  queries  about  the  evolution  of  a 
biological  system.  Temporal  logic  (TL),  also  called  tense  logic,  is  a  modal  logic  that 
incorporates  special  operators,  or  modes,  that  have  a  “temporal”  interpretation.  More 
concretely,  it  analyzes  time  course  data  sets  for  each  observable  variable  using  a  concise 
and  semantically  well-founded  temporal  logic  language.  The  Simpathica/XSSYS 
system  can  utilize  data  from  a  variety  of  sources,  e.g.  the  NYU  MAD  and  NYUSIM 
databases  (RAC+01),  various  BioSpice  modules  (BOS),  PLAS  fdes  (VOO),  and  simple 
CSV  text  files. 

Temporal  Logic  has  been  studied  in  depth  in  the  context  of  systems  whose  behavior 
changes  in  time,  for  instance,  computer  hardware,  network  protocols  and  engineering 
systems.  We  omit  a  detailed  introduction  to  any  or  all  of  many  specific  Temporal  Logics 
that  have  been  introduced  in  the  past.  Instead  we  concentrate  on  the  main  ideas  at  the 
core  of  these  logics  in  order  to  provide  the  intuition  about  how  it  can  be  used  in  the 
analysis  of  biochemical  systems. 

Lundamental  to  a  temporal  logic  is  the  notion  that  time-dependent  terms  from  natural 
language,  such  as  ^''sometimes",  ^"eventually"  and  ""always"  can  be  given  a  precise 
meaning  (semantics)  in  terms  of  the  abstract  behavior  of  a  system  under  discourse.  As  an 
example,  consider  the  following  sentence: 

The  concentration  of  guanosin  triphosphate  (GTP)  is  equal  to  k. 

Such  a  sentence  is  true  only  in  certain  circumstances.  Given  a  biological  system  in 
equilibrium  the  above  sentence  may  or  may  not  be  true  at  any  or  all  instants  of  time.  In 
particular,  we  can  easily  construct  sentences  (in  a  suitable  natural  language)  that  express 
the  fact  that,  given  a  certain  set  of  initial  conditions  the  above  sentence  will  eventually 
hold  true.  Temporal  Logic  precisely  formalizes  the  meaning  of  the  adverb  eventually 
(and  other  such  “modes”:  always,  infinitely  often  and  almost  always)  and  the  resulting 
semantics  lead  to  a  precise  model-checking  algorithm  for  determining  the  validity  of  TL 
sentences  in  the  context  of  an  automaton. 

This  particular  attribute  of  TL  is  very  important  as  it  concisely  captures  the  notion  of  a 
logical  property  like  “steady-state”  and  formalizes  this  notion  in  a  simple  consistent  way 
that  is  directly  handled  by  the  model-checking  algorithm. 

Consider  a  system  M  and  a  (simulation)  trace  trace/M).  If  we  consider  a  state  s  in 
trace/M),  we  can  simply  check  if  all  the  first  derivatives  in  s  are  0.  Suppose  we  have  a 
procedure  that  answers  yes  (or  no)  when  this  is  the  case.  Let  us  call  this  predicate, 
zero_derivative.  Suppose  that  there  actually  is  a  state  s'  in  trace/^M)  where 
zero_derivative  yields  Now,  by  the  rules  of  Temporal  Logic  the  following  statement 
would  be  true 

Eventually ( zero_derivative) 

for  each  instant  from  the  start,  at  least  up  until  the  instant  characterized  as  state  s'. 
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Now  we  can  expand  the  language  of  Temporal  Logic  and  introduce  a  new  predicate 
“steady  state”  to  be  a  synonym  of  the  following  concept:  there  exists  an  instant  (a  state  s' 
in  trsiCt(M))  after  which  zero_derivative  will  always  be  true.  More  formally, 

steady_state(M) 

is  defined  to  be  logically  equivalent  to  the  following: 

Eventually (Always ( zero_derivative) ) 

meaning  that,  when  we  consider  the  simulation  (or  in  vivo)  trace  of  the  system  there  will 
be  a  time  where  all  the  rates  of  change  of  the  system's  variables  reach  0  and  remain  at  that 
value. 

Alternatively,  we  could  be  more  selective  and  ask  whether  some  specific  variable  reaches 
the  steady  state.  We  can  determine  the  answer  as  a  result  of  the  Definition  4. 

steady_state(M,  GTP) . 

Another  set  of  properties  that  we  may  want  to  express  (and  subsequently  check)  is  the 
one  involving  “persistence.”  In  other  words,  properties  of  the  form:  something  is  always 
true  (or  false).  For  instance,  we  could  ask  whether  in  a  given  system 

Always (GTP  >  k) . 

Thus,  we  query  whether  the  GTP  level  always  remains  greater  than  k,  independent  of 
other  changes  occurring  during  the  evolution  of  the  system. 

The  previous  discussion  illustrates  the  main  ideas  needed  to  translate  an  English  sentence 
involving  temporal  claims  into  a  query  in  temporal  logic.  The  translation  from  English  to 
TE  is  rather  straightforward.  Simple  conjunctions  (“and”s),  disjunctions  (“or”s)  and 
negations  (“nof’s)  can  be  expressed  directly 

Suppose  we  wish  to  determine  if  (1)  our  system  reaches  a  steady  state  and  (2)  the  level  of 
GTP  is  less  than  k  after  a  certain  instant.  This  statement  is  simply  expressed  in  TL  as 

steady_state  and  Eventually (Always (GTP  <  k)).  (a) 

Note  that  the  validity  of  the  above  statement  is  completely  determined  by  the  two 
constituent  sub-expressions.  Eurthermore,  the  truth  property  of  the  statement  requires 
examining  the  entire  system  trace,  since  steady_state  is  a  “global”  property,  and  the 
second  conjunct  has  the  same  form.  To  appreciate  the  subtleties  of  TL,  consider  the 
following  expression:  eventually  the  system  will  be  in  steady  state  and  the  level  of  GTP 
will  be  less  than  k. 

Eventually (steady_state  and  Always(GTP  <  k)).  (b) 

Given  the  properties  of  TL,  the  above  expression  (if  true)  will  actually  guarantee  that 
when  the  system  attains  the  steady  state,  it  also  has  a  GTP  level  less  than  k.  This  is  a 
different  statement  than  (a),  and  it  shows  how  flexible  and  yet  precise  a  TL  statement  can 
be,  without  sacrificing  a  high  degree  of  expressive  power. 

There  are  other  built  in  operators  like  conditionals  that  describe  the  system  or  the  variable 
in  a  qualitative  way.  Eor  example,  the  statement 

Always (CDKl  >  3  *  CDC25) 

Implies  Eventually ( steady_state () ) . 
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returns  true  if  it  is  the  ease  that  if  CDKl  is  always  more  than  3  times  CDC2  5,  the  system 
eventually  reaehes  steady  state,  that  is,  there  being  no  net  ehange  in  the  values  of  the 
quantitations.  Nested  queries  sueh  as 

Always (PRPP  =1.7  *  PRPPl) 

Implies 

steady_state ( ) 

and  Eventually  (Always (IMP  <  2  *  IMPl)) 

and  Eventually  (Always (hx_pool  <  10  *  hx_pool))). 

are  just  as  simple  for  our  tool  to  evaluate,  though  diffieult  for  a  human  to  understand  at 
first  glanee  (the  variables  PRPP,  PRPPl,  IMP,  IMPl,  hx_pool,  and  hx_pooll 
appear  in  the  analysis  of  the  purine  metabolism  pathway  deseribed  in  (APUM03).) 

In  (APP+03)  we  diseuss  some  of  the  mathematieal  and  eomputational  problems 
assoeiated  with  this  approaeh,  e.g.  the  dependeney  of  the  analysis  on  the  density  of  time 
points.  The  Simpathica/XSSYS  system  essentially  implements  a  model-checking 
algorithm  (CGP99)  based  on  a  “labeling”  of  eaeh  state,  i.e.,  of  eaeh  time-indexed  time 
point.  The  labeling  of  states  enables  the  Simpathica/XSSYS  Traee  Analysis  Tool  to  use 
temporal  logie  to  query  eomplex  logieal  dependeneies  of  the  variables  on  one  another, 
using  also  some  speeialized  “verbs”  whose  meaning  should  be  more  intuitive  for  a 
biologist. 

For  example,  the  query 

Eventually (growing (CDKl ) )  and  Always(CYCB  >  CDC25)). 

would  evaluate  to  true  if  within  the  data  set,  CDKl  eventually  starts  inereasing  and  CYCB 
eoneentration  always  remains  greater  than  that  of  CDC2  5.  If  the  query  is  false  over  the 
traee,  the  system  would  indieate  the  time  at  whieh  it  first  violates  the  eondition. 

Query  Maker  -  A  Natural  Language  interface 

Although  the  Simpathica/XSSYS  system  is  very  powerful  and  effeetive,  it  is  not  very 
aeeessible  to  users  without  experienee  with  the  temporal  logie,  an  admittedly  eomplex 
and  esoterie  mathematieal  tool  for  the  layperson.  Therefore,  we  deeided  to  wrap  the 
Temporal  Logie  system  with  a  natural  language  interfaee  to  make  the  system  more 
aeeessible.  Of  eourse,  several  other  systems  have  approaehed  similar  problems  by 
providing  a  natural  language  interfaee  to  a  eomputational  tool.  E.g.,  pioneering  work  at 
Edinburgh  University  in  natural  language  in  the  eontext  of  model  eheeking  for  hardware 
verifieation  showed  that  a  subset  of  English  is  suffieient  to  express  temporal  logie  queries 
(HK99).  We  adapted  the  approaeh  to  our  biologieal  setting  by  building  a  speeialized  set 
of  “verbs,”  immediately  reeognized  by  a  biologist  (e.g.  “growmg”,  '"steady  state'",  "flat",) 
and  then  tied  it  to  our  speeialized  data  analysis  tool.  All  in  all,  we  assumed  that  “if  a 
question  eannot  be  asked  in  English,  it  will  not  be  asked  by  a  biologist.”  The  Query 
Maker  natural  language  interfaee  is  designed  with  this  prineiple  in  mind. 

The  interfaee  is  built  on  top  of  a  simple,  eontext-free  semantie  parser  (N92).  Eigure  2 
shows  a  sereenshot  of  the  systems.  The  questions  are  first  parsed,  and  have  their 
semanties  interpreted  following  a  set  of  grammar  rules.  Then  the  questions  are  translated 
into  temporal  logie  queries,  whieh  are  then  fed  into  the  temporal  logie  system.  Einally, 
the  Temporal  Logie  queries  are  partially  eompiled  with  a  “Just-In-Time”  eompiler  that 
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produces  machine  eode  for  them.  The  system  runs  under  Windows,  Mao  OSX  and  Linux, 
and  it  also  has  a  Web-based  interfaoe  at  the  address 
http : / /bioinformatics . nyu . edu : 3000 /home /I asp. 

For  example,  if  a  biologist  asks 

“Is  it  eventually  the  case  that  ifvarl  is  always  between  var2  and  var3  and  var4  is 
always  constant,  then  v5  will  always  be  bounded  by  v3?” 

the  question  will  be  translated  to 

Eventually (Always (varl  >  var2  and  varl  <  var3) 
and  Always ( flat (v4) ) ) 

Implies  Always(v3  <  v5) . 

Even  though  Query  Maker  has  many  limitations,  beoause  of  its  small  vooabulary  and  the 
faot  that  not  all  temporal  logio  queries  oan  be  expressed  olearly  in  plain  English,  we  oan 
see  that  it  is  already  able  to  formulate  and  manipulate  relatively  oomplex  queries.  Our 
hope  is  that  after  repeated  usage,  biologists  would  be  able  to  formulate  their  own 
temporal  logio  queries  with  desired  oomplexity. 

Example:  the  Yeast  Cell  Cycle 

The  oell  oyole  is  the  sequenoe  of  repeating  events  through  whioh  a  oell  grows,  replioates 
its  genetic  material,  and  finally,  physically  separates  into  two  daughter  oells.  It  is  a 
tightly  oontrolled  prooess  divided  into  the  Gl,  S,  G2,  and  M  phases,  oorresponding  to 
growth,  duplioation  of  genetic  material,  and  finally  mitosis.  The  oontrol  meohanisms  of 
the  budding  yeast  oell  oyole  oan  be  aoourately  modeled,  as  in  Novak  and  Tyson  (NT97, 
NTOl).  We  will  inquire  the  traoes  of  the  wild-type  model  as  well  as  a  mutant  that  laoks  a 
particular  control  mechanism  (SK-knockout  mutant). 

It  is  known  from  various  published  analysis  -  e.g.  (NT97,  NTOl)  -  that  elimination  of  the 
SK  oontrol  in  the  Gl  phase  oauses  CKIt  (Cyolin-dependent  Kinase  Inhibitor)  levels  to 
remain  high,  disrupting  the  oyoling  of  the  events.  As  a  result,  the  mutant  system  reaohes 
a  premature  steady  state,  while  the  wild-type  oontinues  osoillating  through  the  oell-oyole. 
In  other  words,  the  question 

“Will  the  system  eventually  reach  steady  state?” 

will  yield  a  true  answer  for  the  mutant  ease,  and  yield  a  false  answer  for  the  wild-type. 

It  is  also  known  that  in  wild  type  yeast,  when  CKIt  level  drops  below  CyoBt,  active 
Cyolin  B  begins  to  form  and  aotivates  a  oasoade  of  events  that  propels  the  oell  to  divide. 
In  the  mutant,  sinoe  CKIt  levels  do  not  drop  due  to  the  absenoe  of  SK,  Cyolin  B  level 
remains  low.  Therefore,  the  question 

“After  0.1  minutes,  when  CKIt  is  less  than  or  equal  to  CycBt,  does  CycBt  increase?” 

will  yield  a  true  answer  for  the  mutant  ease,  and  yield  a  false  answer  for  the  wild-type.  In 
the  mutant  ease  the  system  answers  with 
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The  formula 

Eventually ( (TICK  >  0.1) 

and  AU(not(CKIT  <=  CYCBT) 

and  not ( GROWING ( CYCBT ) ) 

UNTIL 

(CKIT  <=  CYCBT 
and  GROWING (CYCBT) )) ) 

is  false  over  the  trace. 

I.e.  the  formula  is  false  in  the  mutant  case.  Note  the  “internal”  variable  TICK,  which 
represents  time. 


Integrated  and  Web-based  User  Interfaces 

We  have  built  two  user  interfaces  for  the  Query  Maker  subsystem  of  XSSYS:  an 
integrated  one  for  the  stand-alone  application,  shown  in  Figure  2,  and  a  Web  based  one. 

The  integrated  interface  allows  a  user  to  formulate  questions  and  check  answers  while 
being  able  to  access  all  the  functionalities  of  the  XSSYS  system.  We  also  provide  a 
simple  “Help”  facility  that  explains  in  a  graphical  way  the  meaning  of  each  temporal 
logic  operator,  and  that  explains  how  to  formulate  questions  that  make  them. 


Figure  2.  A  screenshot  of  the  Simpathica/XSSYS  Natural  Language  Interface.  The  “Query  Maker” 
window  is  used  to  type  in  English  queries.  A  Help  System  showing  the  intuitive  meaning  of  typical  queries 
can  be  also  consulted  to  facilitate  the  expression  of  the  Temporal  Logic  queries. 
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The  Web  based  interfaee  maps  some  of  the  simpler  funetionalities  of  the  XSSYS 
applieation.  The  Web  based  interfaee  is  organized  in  three  pages:  (a)  “dataset  seleetion” 
page,  (b)  a  “query”  page,  and  (c)  a  “results”  page.  The  three  pages  are  shown  in  Figure  3, 
Figure  4,  and  Figure  5.  The  dataset  chooser  connects  to  our  NYUSIM  database,  which  is 
a  repository  of  simulation  traces.  Simpathica  and  XSSYS  write  and  read  this  database 
thus  making  it  possible  to  keep  a  well  ordered  list  of  datasets  along  with  their  necessary 
meta-data  for  identification  and  explanation. 


NYU  Bioinformatics  Group 
Qu«ry  Makw  orvlin* 


•  Home  This  is  an  online  version  of  the  XSSYS/Simpatf>ica  Trace  Analysis  ^stem  developed  in  the  NYU  Bioinformatics  group.  The  goal  of  the  project  is  to  provide  an 

•  Links  accessible  way  of  analyzing  time-series  data  in  the  field  of  (but  not  limited  to)  biology  such  as  micro-array  time-course  data  as  well  as  simulated  traces  of  metabolic 

•  Group  models.  The  System  loads  in  traces,  categorizes  the  data  into  states,  and  applies  a  sophisticated  temporal  logic  sub-system  to  allow  qualitative  queries  on  these 
Home  traces. 

Our  Query  Maker  natural  language  interface  hides  these  complex  mathematical  tools  from  biologist  to  allow  easy  analysis  of  the  applicatable  datasets,  which  are 
burgeoning  in  quantity  due  to  reduction  of  cost  in  microarray  chips.  For  a  full  description  of  the  tool,  please  consult  this  short  note. 

To  use  the  system,  first  select  an  experiment.  It  will  bring  you  to  a  page  with  a  description  of  the  experiment.  There  you  may  select  a  dataset  within  the  experiment  and 
query  the  traces. 

Select  Experiment 


BioSpice  Demo 
BioSpice  Demo  II 
EXX 

Ian'S  Experiment 
Joe 

Nadia's  Experiment 
repressilator 
Test  Experiment  1 


Select  Experiment  | 


Figure  3.  This  is  the  opening  page  of  the  “Query  Maker  Online”  (QMO)  system  viewable  at 
http://bioinformatics.nyu. edu:3001/Projects/qmo/lasp/home.  The  system  shows  a  list  of  the  “experiments” 
for  which  the  NYUSIM  database  has  datasets  visible  to  the  general  public. 
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*  Home  Experiment  name:  Ian's  Experiment 

•  Group  Home 

Experiment  Synopsis 

No  ^nops/s  Available 


Variables 

CycB 


Cdhi 

Enter  Query  Here 

CdcZOT 

Sel^t  Dataset 

rfiU  tlaft  Syottfa  eventually  ueecli  oteec^r  etete? 

Cdc20A 

Tyson  Yeast  Dataset  Muti 

Mad 

^  Tyson  Yeast  Dataset  mut2 

lEP 

®  Tyson  Yeast  Dataset  WT 

m 

C  Tyson  Yeast  Simulation 

CycBt 

Submit  Query  | 

CKIT 

SK 


Figure  4.  Once  an  experiment  has  been  selected,  QMO  shows  a  page  with  a  list  of  the  “variables” 
appearing  in  each  of  the  datasets  form  the  experiment.  The  user  can  enter  a  query  involving  those  variables 
in  the  text  area  on  the  right. 
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Experiment:  "Ian's  Experiment" 

•  Home 

•  Links 

«  Group  Home  Your  query  "Will  the  System  eventually  reach  steady  state  "  on  dataset  "Tyson  Yeast  Dataset  WT"  is  false. 


Figure  5.  After  accessing  the  NYUSIM  database,  loading  the  data  on  the  QMO  server  and  performing  the 
query  analysis,  the  final  page  shows  the  result. 

Sentence  Generation  of  “Biologically  Interesting  Factoids” 

At  its  core  the  XSSYS  system  manipulates  a  set  of  CTL  temporal  logie  formulae.  Each 
formula  is  easily  translated  into  a  natural  language  (English).  Given  this  features,  we 
explored  the  possibility  to  automatically  generate  several  temporal  logic  formulae  in 
inereasing  order  of  complexity  (i.e.  formula  length,)  with  the  intent  of  diseovering  new 
faets  about  a  dataset,  and  to  produee  a  “  biologically  interesting  factoid"  story  for  the 
consumption  of  a  biologically  savvy  reader.  I.e.  we  have  set  up  a  traditional  generate- 
and-test  framework.  In  the  following  we  describe  the  generation  algorithm  and  discuss 
some  of  its  features. 

The  generation  algorithm  must  use  several  heuristies  to  eonstrain  the  size  of  the  set  F  of 
formulae  to  be  analyzed,  as  a  simple  eounting  argument  on  the  strueture  of  the  eonerete 
syntax  of  CTL  formulae,  reveals  that  the  number  of  formulae  of  “syntaetie  depth”  d  is 
j:  obviously  too  large  a  number  to  consider,  even  for  the  simple  case  of  (/  =  3. 

Given  a  number  of  relatively  straightforward  heuristics,  the  formula  generation  and 
testing  procedure  can  be  kept  under  control,  although  the  worst  case  scenario  still  applies. 
The  heuristies  involved  are  based  on  a  (arbitrary)  lexieographie  ordering  of  the  variables, 
and  on  an  aceounting  of  the  symmetries  in  the  binary  operators  of  the  underlying 
temporal  logic  language.  Also,  user  supplied  ranges  for  the  values  of  the  variables 
involved  are  taken  into  account.  In  essence  the  procedure  performs  the  following  steps: 

Procedure  Formula  Generation: 

1 .  Input:  a  set  of  variable  from  an  experiment;  the  element  of  are  the 
“story  variables.” 

2.  For  each  formula  template  from  the  set: 

a.  Represses  (PI ,  P2). 

b.  Activates (PI ,  P2). 

c.  Steady _state  ()  . 

d.  Constant (P,  tl,  t2). 

e.  Formulae  representing  the  response  of  the  system  to  a  particular  input 
at  time  t,  (e.g.  an  impulse  or  a  sustained  input.) 

generate  the  set  of  all  possible  combinations  of  instantiated  formulae  using 
only  the  elements  of  E5. 
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Because  of  the  set  of  heuristics  used,  the  resulting  set  of  formulae  has  limited  size.  Once 
the  set  of  formulae  F  has  been  generated,  then  we  can  check  each  of  its  members  against 
the  datasets  comprised  in  an  experiment.  Figure  6  shows  the  overall  architecture  of  the 
“story  generation”  system.  The  result  will  be  a  set  of  valuations  for  each  /  G  F  with 
respect  to  each  dataset;  e.g.  a  dataset  corresponding  to  a  wild-type  and  one  corresponding 
to  a  mutant,  as  in  the  Yeast  Cell  Cycle  example  given  before. 


Figure  6.  The  architecture  of  the  “biologically 
interesting  factoid”  generation  system.  Given  a 
number  of  datasets  (logically  belonging  to  a  given 
“experiment”,  the  system  generates  a  set  of  CTL 
formulae  using  a  number  of  carefully  chosen 
heuristics  (to  constrain  the  number  of  formulae  being 
generated).  Each  formula  is  fed  to  the  temporal  logic 
analyzer  XSSYS,  which  is  essentially  a  restricted 
model-checker,  and  the  results  of  such  analysis  is 
then  fed  to  the  Natural  Language  Generation  system 
which  finally  produces  an  HTML  formatted  file. 


Given  a  number  of  datasets  and  a  set  of  “interesting”  values  for  the  variables  in  Vs,  the 
“factoid  generation”  system  produces  an  HTML  formatted  output.  Table  1  shows  an 
excerpt  from  the  output  produced  by  analyzing  three  datasets  obtained  by  simulation  of 
the  Yeast  Cell  Cycle  models  described  in  (NT97). 
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Report  on  "Test  Experiment  Tyson  WT,  1  Mutant,  2  Mutants.". 
RESULTS 

The  results  refer  to  the  following  datasets: 

The  first  dataset  is  named  "Ian's  Experiment/Tyson  Yeast  Dataset  WT". 

The  second  dataset  is  named  "Ian's  Experiment/Tyson  Yeast  Dataset  Mutl". 
The  third  dataset  is  named  "Ian's  Experiment/Tyson  Yeast  Dataset  mut2". 


84.  CDEIl  less  than  or  equal  to  1.0071783  will  always  hold  until  CDEIl  activates  CYCB,  is  true 
in  the  first  dataset,  is  true  in  the  second  dataset,  and  is  false  in  the  third  dataset. 

85.  CDEIl  represses  CYCB  implies  CYCB  is  greater  than  or  equal  to  0.65,  is  false  in  the  first 
dataset,  is  true  in  the  second  dataset,  and  is  true  in  the  third  dataset. 

86.  CDEIl  greater  than  or  equal  to  1.0071783  will  always  hold  until  CDHl  activates  CYCB,  is 
false  in  the  first  dataset,  is  true  in  the  second  dataset,  and  is  true  in  the  third  dataset. 

87.  eventually,  CDHl  is  less  than  or  equal  to  CYCB,  is  false  in  the  first  dataset,  is  true  in  the 
second  dataset,  and  is  true  in  the  third  dataset 


Table  1 .  A  fragment  of  the  “biologically  interesting  factoid”  story  produced  by  the  generation  system.  The 
system  actually  produced  234  such  sentences  involving  the  species  CDHl  and  CYCB  and  a  number  of 
“interesting  values”  they  can  assume.  These  sentences  can  be  more  readily  looked  up  by  a  biologist  and 
possibly  indexed  for  better  retrieval. 


Concluding  Remarks 

We  have  presented  a  simple  natural  language  interfaee  for  a  time  eourse  data  analysis 
tool  that  tackles  the  problem  of  making  a  mathematically  sophisticated  system  more 
accessible  to  a  biologist  with  little  mathematical  training.  We  have  also  presented  an 
initial  system  that  is  capable  of  generating  an  English  rendition  of  a  long  list  of  simple 
facts  about  a  given  biological  system  for  which  we  have  a  simulatable  model  or  an 
experimental  “trace”.  While  our  systems  rely  on  a  large  body  of  literature  and  experience, 
it  also  represents  a  novel  integration  of  a  wide  array  of  techniques  to  solve  a  general 
problem  facing  bioinformaticists  and  biologists.  Our  implementations  can  obviously  be 
improved  in  several  ways.  For  instance,  we  are  working  closely  with  biologists  to  expand 
the  set  of  predefined  “verbs”  and  the  grammar  rules  to  account  for  more  elaborate 
questions.  Moreover,  we  are  also  taking  into  account  more  sophisticated  temporal  logic 
formulations  that  will  lead  to  the  manipulation  of  more  expressive  questions. 
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